Skip to content

fix(datasets): Add shuffle option to IidPartitioner#7385

Open
WilliamLindskog wants to merge 7 commits into
mainfrom
fix/flwr-datasets-iid-metrics
Open

fix(datasets): Add shuffle option to IidPartitioner#7385
WilliamLindskog wants to merge 7 commits into
mainfrom
fix/flwr-datasets-iid-metrics

Conversation

@WilliamLindskog

@WilliamLindskog WilliamLindskog commented Jun 15, 2026

Copy link
Copy Markdown
Member

What changed

  • Add optional shuffle and seed parameters to IidPartitioner
  • Preserve the existing contiguous-slice behavior by default (shuffle=False)
  • Cache the shuffled dataset per partitioner instance so repeated partition loads stay stable, including when seed=None
  • Document the sorted-local-dataset case where shuffling before IID partitioning avoids label-skewed partitions

Issue/PR mapping

Validation

  • pytest, ruff, mypy, and black --check on the touched IID partitioner files
  • git diff --check

Copilot AI review requested due to automatic review settings June 15, 2026 20:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances flwr_datasets by (1) adding optional shuffling to IidPartitioner while preserving the historical contiguous-shard default behavior, and (2) introducing public partition skew distance metrics (Hellinger and Jensen–Shannon) to quantify how partition label/target distributions differ from the full dataset.

Changes:

  • Extend IidPartitioner with shuffle/seed and cache the shuffled dataset per instance for stable repeated loads.
  • Add compute_hellinger_distances and compute_jensen_shannon_distances (with optional binning for continuous targets) plus test coverage.
  • Update README and Sphinx docs to reflect the new IidPartitioner signature and new skew metrics.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
datasets/README.md Documents IidPartitioner(shuffle, seed) usage and introduces partition skew metrics in the library overview/quickstart.
datasets/flwr_datasets/partitioner/iid_partitioner.py Adds shuffle/seed parameters and per-instance caching of the shuffled dataset used for sharding.
datasets/flwr_datasets/partitioner/iid_partitioner_test.py Adds regression and determinism tests for default contiguous behavior and shuffled sharding.
datasets/flwr_datasets/metrics/utils.py Implements Hellinger and Jensen–Shannon distance utilities (including optional binning) and related helpers.
datasets/flwr_datasets/metrics/utils_test.py Adds unit tests validating distance values, binning behavior, and input validation.
datasets/flwr_datasets/metrics/__init__.py Exposes the new metric functions as part of the public flwr_datasets.metrics API.
datasets/docs/source/index.rst Updates feature list and IidPartitioner signature in docs landing page.
datasets/docs/source/how-to-use-with-local-data.rst Adds guidance for shuffling sorted local datasets via IidPartitioner(shuffle=True, seed=...).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@WilliamLindskog WilliamLindskog changed the title fix(datasets): add IID shuffle and partition skew metrics fix(datasets): Add IID shuffle and partition skew metrics Jun 15, 2026
@github-actions github-actions Bot added the Maintainer Used to determine what PRs (mainly) come from Flower maintainers. label Jun 15, 2026
@WilliamLindskog WilliamLindskog changed the title fix(datasets): Add IID shuffle and partition skew metrics fix(datasets): Add shuffle option to IidPartitioner Jun 16, 2026

Copy link
Copy Markdown
Member Author

Update: I split this into the lower-friction review path.

Focused validation passed for the touched IID partitioner files: pytest, ruff, mypy, black --check, and git diff --check.

@WilliamLindskog WilliamLindskog marked this pull request as ready for review June 16, 2026 03:15
Comment thread datasets/docs/source/index.rst Outdated
Comment thread datasets/flwr_datasets/partitioner/iid_partitioner.py Outdated
if not self._shuffle:
return self.dataset
if self._shuffled_dataset is None:
self._shuffled_dataset = self.dataset.shuffle(seed=self._seed)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we have two copies of the dataset?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Dataset.shuffle(...) returns another Hugging Face Dataset object with shuffled indices/cache metadata rather than eagerly duplicating all row data. So this keeps a second dataset object around, but it should not be a full in-memory copy of the underlying dataset. The cache here is intentional so repeated load_partition calls use the same shuffled order, especially when seed=None.

Comment thread datasets/README.md Outdated
Comment thread datasets/README.md Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Maintainer Used to determine what PRs (mainly) come from Flower maintainers.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature]: Support shuffling in IIDPartitioner or update documentation

3 participants